{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Signatures and duplicates selection\n",
    "\n",
    "(c) 2019, Dr. Ramil Nugmanov; Dr. Timur Madzhidov; Ravil Mukhametgaleev\n",
    "\n",
    "Installation instructions of CGRtools package information and tutorial's files see on `https://github.com/cimm-kzn/CGRtools`\n",
    "\n",
    "NOTE: Tutorial should be performed sequentially from the start. Random cell running will lead to unexpected results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pkg_resources\n",
    "if pkg_resources.get_distribution('CGRtools').version.split('.')[:2] != ['3', '1']:\n",
    "    print('WARNING. Tutorial was tested on 3.1 version of CGRtools')\n",
    "else:\n",
    "    print('Welcome!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# load data for tutorial\n",
    "from pickle import load\n",
    "from traceback import format_exc\n",
    "\n",
    "with open('molecules.dat', 'rb') as f:\n",
    "    molecules = load(f) # list of MoleculeContainer objects\n",
    "with open('reactions.dat', 'rb') as f:\n",
    "    reactions = load(f) # list of ReactionContainer objects\n",
    "\n",
    "m1, m2, m3 = molecules[:3] # molecule\n",
    "m7 = m3.copy()\n",
    "m11 = m3.copy()\n",
    "m11.standardize()\n",
    "m7.standardize()\n",
    "r1 = reactions[0] # reaction\n",
    "m1.reset_query_marks()\n",
    "m1.flush_cache()\n",
    "m1.delete_atom(3) \n",
    "cgr2 = ~r1  \n",
    "cgr2.reset_query_marks()\n",
    "m3.reset_query_marks()\n",
    "proj = m3.substructure([4,5,6,7,8,9])\n",
    "benzene = benzene = m3.substructure([4,5,6,7,8,9], as_view=False) \n",
    "benzene.reset_query_marks()\n",
    "proj_copy = proj.copy()\n",
    "proj_copy.reset_query_marks() \n",
    "m3.delete_bond(4, 5) \n",
    "proj.flush_cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1. Molecule Signatures\n",
    "*MoleculeContainer* has methods for unique molecule signature generation.\n",
    "Signature is SMILES string with explicit bonds notation and canonical atoms ordering. For pyroles, signatures does not comply with the SMILES rules.\n",
    "\n",
    "For signature generation one need to call `str` function on MoleculeContainer object.  \n",
    "Fixed length hash of signature could be retrieved by calling `bytes` function on molecule (correspond to SHA 512 bitstring).\n",
    "\n",
    "Order of atoms calculated by Morgan-like algorithm. On initial state for each atoms it's integer code calculated based on its type. All bonds incident to atoms also coded as integers and stored in sorted tuple. Atom code and tuple of it's bonds used for ordering and similar atoms detecting. Ordered atoms rank is replaced with prime numbers from a prime number lookup table. Atoms of the same type with the same bonds types incident to it have equal prime numbers.\n",
    "\n",
    "Prime numbers codes found are used in Morgan algorithm cycle.\n",
    "\n",
    "On each loop for each atom square of its prime number is multiplied to neighboring atoms prime numbers, observed numbers for atoms are ranked and prime numbers are again assigned. Loop is repeated until all atoms will be unique or number of unique atoms will not change in 3 subsequent loops."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms2 = str(m2)  # get and print signature\n",
    "print(ms2)  \n",
    "# or \n",
    "print(m2)\n",
    "\n",
    "hms2 = bytes(m2)  # get sha512 hash of signature as bytes-string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "String formatting is supported that is useful for reporting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'f string {m2}')  # use signature in string formatting\n",
    "print('C-style string %s' % m2)\n",
    "print('format method {}'.format(m2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of neighbors and hybridization could be added to signature. Note that in this case they are not readable as SMILES.\n",
    "\n",
    "For MoleculeContainer and ReactionContainer query marks are not included in signatures and not printed by str and print function. However they could be printed in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m2.reset_query_marks() # calculate hybridization and number of neighbors\n",
    "print(f'{m2:hn}')  # get signatures with hybridization and neighbors data\n",
    "print('{:h}'.format(m2))  # get signature with hybridization only data\n",
    "# h - hybridization marks, n- neighbors marks\n",
    "format(m2, 'n') # include only number of neighbors in signature\n",
    "print(m2)   # notice that neighbors and hybridization are hided, since signatures for molecules does not contain this information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Atoms in the signature are represented in the following way: [element_symbol;hn;charge;multiplicity (if not None)]. h mean hybridization, n - number of neighbors. Notation for hybridization is the following:\n",
    "\n",
    "    s - all bonds of atom are single\n",
    "    d - atom has one double bond and others are single\n",
    "    t - atom has one triple or two double bonds and other are single\n",
    "    a - atom is in aromatic ring\n",
    "\n",
    "Examples: s1 - atom has s hybridization and one neighbor\n",
    "d3 - atom has d hybridization and 3 neighbors\n",
    "\n",
    "Signatures for QueryContainer and QueryCGRContainer include query marks and are printed by str and print function.\n",
    "\n",
    "Signatures of projections also supported.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f'{proj:h}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Molecules comparable and hashable\n",
    "\n",
    "Comparison of MoleculeContainer is based on its signatures. Moreover, since strings in Python are hashable, MoleculeContaier also hashable.\n",
    "\n",
    "NOTE: MoleculeContainer can be changed. This can lead to unobvious behavior of the sets and dictionaries in which these molecules were placed before the change. Avoid changing molecules (standardize, aromatize, hydrogens and atoms/bonds changes) placed inside sets and dictionaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m1 != m2 # different molecules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m7 == m11 # copy of the same molecule"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m7 is m11  # this is not same objects!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "benzene == proj_copy   # projection extracted benzene structure from molecule and then was transformed in MoleculeContainer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simplest way to exclude duplicated structures\n",
    "len({m1, m2, m7, m11}) == 3 # create set of unique molecules. Only 3 of them were different."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2. Reaction signatures\n",
    "ReactionContainer have its signature. Signature is SMIRKS string in which molecules of reactants, reagents, products presented in canonical order.\n",
    "\n",
    "API is the same as for molecules\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "str(r1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3. CGR signature\n",
    "CGRContainer have its signature. Signatures is SMIRKS-like strings where atoms in reactants and products has same order and split by >> symbol"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "str(cgr2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If one align left- and right-hand side of signature, he will see bond order changes.\n",
    "\n",
    "C-C-\\[O-\\].C(=O)(-O)-C(=O)(-O).O-C-C  \n",
    "C-C-O-C(=O)(.O)-C(=O)(.O)-O-C-C"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cgrtools",
   "language": "python",
   "name": "cgrtools"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
